refactor/metadata package#3919
Conversation
ef4b773 to
fbc24f3
Compare
Adds packages/zarr-metadata, a sibling PyPI package that contains
spec-defined Zarr v2 and v3 metadata types as pure-typing artifacts
(TypedDicts, type aliases, Final string constants, NewType validators).
No runtime logic beyond minimal regex validators for spec-format-locked
strings (hex floats, base64 bytes, raw-bytes name).
Layout (anything imported from `zarr_metadata.v3.X` is a v3-spec
artifact; from `zarr_metadata.v2.X` is v2-spec; only true cross-version
primitives sit at the top level):
zarr_metadata/
├── common.py # JSON, NamedConfig, NamedRequiredConfig
├── __init__.py # ArrayMetadata, GroupMetadata unions
├── v2/
│ ├── array.py, group.py, consolidated.py, codec.py
└── v3/
├── array.py, group.py, consolidated.py
├── chunk_grid/ {regular, rectilinear}
├── chunk_key_encoding/ {default, v2}
├── codec/ {blosc, bytes, crc32c, gzip,
│ sharding, transpose, zstd}
└── data_type/ {bool, int8/16/32/64, uint8/16/32/64,
float16/32/64, complex64/128, bytes,
string, numpy_datetime64, numpy_timedelta64,
struct, raw}
zarr-python source is unchanged in this branch. zarr-metadata is shipped
as an independent package; a follow-up PR will adopt it inside zarr-python
once the package is published to PyPI.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…factor/metadata-package
e6303f5 to
0ae8db9
Compare
The structural tests constructed dict literals annotated with TypedDict
types and asserted that a key we just inserted came back out -- which
exercises Python's dict, not the type. The TypedDict has no runtime
shape check, so even a type-incompatible dict would pass. Pyright (in
CI) is what actually verifies the shapes.
Constant-equality tests (e.g. `assert NAME == "name"`) also tested
nothing.
Kept the validating-constructor tests for hex_float{16,32,64},
base64_bytes, and raw_bytes_dtype_name -- those exercise real regex
logic and catch real bugs.
Replaced test_imports.py with a single smoke test that confirms the
package loads and the top-level union types are reachable. Renamed
test_structural.py -> test_validators.py to match what it actually
tests.
Test count: 47 -> 6, ~750 lines -> 89.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3919 +/- ##
=======================================
Coverage 93.11% 93.11%
=======================================
Files 85 85
Lines 11365 11365
=======================================
Hits 10582 10582
Misses 783 783 🚀 New features to boost your workflow:
|
|
if this gets merged, we would need the following changes in subsequent PRs:
|
zarr-metadata is a typing-only foundational package consumed by
libraries; supporting one Python version below the current minimum
widens the audience at minimal cost.
The only blocker was PEP 695 generic class syntax in `common.py`:
class NamedConfig[TName: str, TConfig: ...](TypedDict):
Rewritten to the PEP 484 `Generic[T]` form, which works on 3.11+.
The two affected classes carry `# noqa: UP046` comments since the
ruff rule pushes toward the newer syntax that we deliberately avoid.
All other modern features (PEP 604 `X | Y` unions at runtime,
`NotRequired`, `extra_items=` via typing_extensions, `ReadOnly`)
already work on 3.11.
Verified: package imports and all 6 tests pass on Python 3.11.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@zarr-developers/python-core-devs I'd love feedback on this. |
I really like the idea of having a public interface for metadata classes with minimal dependencies, and am fine with a sub-package as the mechanism for doing that. The order of operations seems a little off to me. In particular, I'm concerned about drift over the course of the follow-up PRs, and us realizing that we're not happy with the design of zarr-metadata while migrating the core library. The following sequence would make more sense to me:
As an aside, I'd recommend reserving the package names that you're hoping to use on PyPI in anticipation of these changes |
The purpose of this PR is not to define metadata classes that zarr-python uses. It is narrowly scoped to define typeddicts that model the contents of zarr metadata documents. So this package doesn't affect anything about the shape of metadata-modelling classes in zarr-python, except by defining the type of their JSON form. This package would only replace typeddict / literal types that we currently define inside zarr-python. Allowing this package to own those types requires no behavioral changes from zarr-python. That's why my timeline is roughly "make this package stand alone (exist on pypi), then make zarr-python depend on it" |
What aspects of the code added here do you think might drift? The only changes I anticipate would be fixes for the following issues:
All of these can be handled gracefully by downstream packages.
good idea, this is done under my own pypi account. if this PR is approved I would transfer ownership to zarr-developers. |
It seemed from your earlier request for feedback that this subpackage was intended to eventually be useful for zarr-python.
Most visibly, I'd like some more thought to be put into the names of the classes. It's inconsistent that some include Metadata and others don't. |
The types and typeddicts in this package are useful for zarr python. These types define the JSON structure of the array metadata, including all the chunk grids, codecs, data types, etc. We use this information to model the JSON metadata documents we read and write. So the impact of this package would basically be upstreaming many typeddict and literal types that zarr-python defines, e.g. this one.
The array metadata, group metadata, and consolidated metadata typddicts are named |
maxrjones
left a comment
There was a problem hiding this comment.
I like the direction here. I think we should have a very high bar for accuracy if we're promoting these as the canonical representation for public use.
For naming (recognizing this is bike-shedding), I'd prefer consistent suffix conventions:
- Codecs all end in Codec
- DataTypes all end in DataType
- Chunk grids all end in ChunkGrid
I think this functionality would be best tested with some round-trip property tests
| See https://zarr-specs.readthedocs.io/en/latest/v2/v2.0.html#data-type-encoding | ||
| """ | ||
|
|
||
| fieldname: str | ||
| datatype: str | ||
| shape: NotRequired[tuple[int, ...]] |
There was a problem hiding this comment.
wouldn't the spec-faithful representation just have a list of lists? the V2 spec doesn't have JSON keys for fieldname or datatype
| data_type: MetadataField | ||
| shape: tuple[int, ...] | ||
| chunk_grid: MetadataField | ||
| chunk_key_encoding: MetadataField | ||
| fill_value: JSON | ||
| codecs: tuple[MetadataField, ...] | ||
| attributes: NotRequired[Mapping[str, JSON]] | ||
| storage_transformers: NotRequired[tuple[MetadataField, ...]] |
There was a problem hiding this comment.
there are more restrictions on data_type, chunk_grid, chunk_key_encoding, and codecs than implied by MetadataField. This would be more useful if it were more strictly typed. In addition, MetadataField allowing a string shorthand is wrong in the chunk grid case.
| """ | ||
|
|
||
| name: Crc32cCodecName | ||
| configuration: NotRequired[dict[str, JSON]] |
There was a problem hiding this comment.
this typing allows erroneous metadata (e.g., "configuration": {"foo": 1}})
| clevel: int | ||
| shuffle: BloscShuffle | ||
| blocksize: int | ||
| typesize: int |
There was a problem hiding this comment.
typesize should be optional because it is not required with noshuffle
|
|
||
| Codec = str | Mapping[str, object] | ||
| """ | ||
| The widest JSON shape that can specify a codec (v2 or v3). |
There was a problem hiding this comment.
why have a v2 or v3 codec under the v3 namespace?
| from zarr_metadata.common import JSON, NamedConfig | ||
|
|
||
|
|
||
| class AllowedExtraField(TypedDict, extra_items=JSON): # type: ignore[call-arg] |
There was a problem hiding this comment.
I'd slightly prefer OptionalExtension as the name here
| "INT32_DTYPE_NAME", | ||
| "Int32DTypeName", | ||
| "Int32FillValue", |
There was a problem hiding this comment.
do we really need three exports for each data type/codec? Could we start more minimal and add more if they are needed?
| DateTimeUnit = Literal[ | ||
| "Y", "M", "W", "D", "h", "m", "s", "ms", "us", "μs", "ns", "ps", "fs", "as", "generic" | ||
| ] | ||
| """Time unit codes used by numpy.datetime64.""" |
There was a problem hiding this comment.
this should be declared once and shared between this and timedelta
| class NamedRequiredConfig(TypedDict, Generic[TName, TConfig]): # noqa: UP046 | ||
| """ | ||
| Named-config envelope with required configuration. | ||
|
|
||
| Generic with two parameters: name literal and configuration mapping. | ||
|
|
||
| Uses the PEP 484 ``Generic[T]`` form rather than PEP 695 ``[T]`` syntax | ||
| so the package supports Python 3.11. | ||
| """ | ||
|
|
||
| name: ReadOnly[TName] | ||
| configuration: ReadOnly[TConfig] |
This PR creates a new subpackage called
zarr-metadatajust for JSON metadata. it's stored inpackages/zarr_metadata. It contains typeddict classes that model the spec-defined JSON forms for v2 array + group metadata, and v3 array + group metadata, including data types, codecs, chunk key encodings and chunk grids. I only included type definitions for metadata that has an external spec. So zarr-python will need to define some types internally for e.g. unspecified data types or codecs.I would like to publish this subpackage to pypi. These types useful to any python tool that works with zarr data, even if that tool doesn't use zarr-python. It is also useful to zarr-python, because it means we can remove and resolve some lingering questions about publishing types.
If we adopted the changes here, adding a new data type / codec / chunk grid, etc, would require adding types to
zarr-metadata, then adding the implementation in zarr-python that work with those types. We wouldn't need to do these 2 operations in the same PR, but I expect that would be the normal practice.This change does add complexity to our publishing workflow: we need to ensure that
zarr-metadatachanges are published at or before newzarr-pythonreleases. We should add some checks to ensure that this happens.Docs for the new package are missing from this PR. I would handle that in a follow-up.
I would appreciate feedback at all levels, including the following topics:
closes #3355 and #3795